-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP expand Variables class to handle s3 urls from NSIDC #434
Conversation
@@ -72,6 +74,14 @@ def __init__( | |||
elif self._vartype == "file": | |||
# DevGoal: check that the list or string are valid dir/files | |||
self.path = path | |||
elif self._vartype == "nsidc-s3": | |||
# Grab metadata from s3 path | |||
template = ('s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This template looks to be hardcoded to the non-gridded ATLAS datasets like ATL06. Other gridded products (e.g. ATL14, ATL16, ATL20) would have a different template like s3://nsidc-cumulus-prod-protected/ATLAS/{product}/{version}/{year}/{filename}
' . E.g. looking at https://search.earthdata.nasa.gov/search?ff=Available%20in%20Earthdata%20Cloud&fi=ATLAS&gdf=HDF&fst0=Cryosphere&lat=65.08299573518599&long=-25.69921875&zoom=5:
Dataset | Sample path |
---|---|
ATL06 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2023/04/16/ATL06_20230416235213_04061911_006_02.h5 |
ATL07 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL07/005/2022/10/12/ATL07-02_20221012220720_03391701_005_01.h5 |
ATL10 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL10/005/2022/10/12/ATL10-01_20221012220720_03391701_005_01.h5 |
ATL11 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL11/005/2022/03/27/ATL11_006305_0315_005_03.h5 |
ATL14 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL14/002/2019/ATL14_IS_0314_100m_002_01.nc |
ATL16 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL16/004/2022/ATL16_20220722003637_04601601_004_01.h5 |
ATL20 | s3://nsidc-cumulus-prod-protected/ATLAS/ATL20/003/2022/ATL20-01_20220901002201_10861601_003_01.h5 |
Would it be possible to generalize this code to both non-gridded and gridded products?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for catching this @weiji14! I'll work on a fix and let you know when I'm ready for another review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question for @JessicaS11 and @weiji14 -- What do we think of getting the version and product name from inside the file instead of parsing it from the filename? I've only checked a handful of products, but those fields seem to be available in top-level metadata in a consistent way. I've been trying to parse those things out of the filename, which is how I believe it is also done elsewhere in the module, but this limits the files icepyx can process to those named in a very specific way. If we grab product/version from inside the file we are able to process more files (ex. cloud icesat-2 files not in nsidc bucket, or local files that have had their name changed). Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that the place I was thinking about this last was in the branch to remove intake from icepyx. I just pushed a WIP PR (#438) so there is a place to discuss questions. Hopefully whatever we decide there about accessing the product/version from that can be used for this PR later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion summarized in #438 (comment) indicates our intention to move away from requiring the user provide the product as input (unless they are also feeding in a directory containing files from multiple products). This should address the template issues noted here.
Co-authored-by: Jessica Scheick <[email protected]>
Co-authored-by: Jessica Scheick <[email protected]>
Co-authored-by: Jessica Scheick <[email protected]>
Co-authored-by: Wei Ji <[email protected]>
Co-authored-by: Jessica Scheick <[email protected]>
This PR prompted two prior PRs: #444 and #451. I think those updates will are exciting for enabling the goal of this PR, allowing the Variables class to list s3 data variables. The changes in those two prior have made the approach to this goal is quite different than when this PR was opened (it was more complicated back then!). As a result I'm going to close this PR and open a new one to pursue adding s3 url reads to the Variables class. |
What was done
At the moment the
Variables
class.avail()
method would fail if given an s3 url. This PR expands the class to handle this case.How it was done
In addition to
file
andorder
a newvartype
,nsidc-s3
, was added to the class.nsidc
was specified (instead of justs3
) because extracting the product and version from the filepath relies on the nsidc naming structure. So, you couldn't just giveVariables
the s3 path to any IS2 file anywhere on AWS; it needs to be from thensidc-cumulus-prod-protected
bucket. If others have feedback on this decision I'm happy to hear. We could alternately try to extract the product and version from the metadata, which would remove this limitation.Within the
.avail()
method thensidc-s3
filetype uses the same variable extraction strategy asorder
. An interesting note about this is that if we use theorder
method of grabbing variables it happens very quickly. It's the time to ping an API endpoint and parse the request. The.avail()
function gets very slow with cloud data when trying to walk the full file (as happens when using thefile
argument), but we can avoid that by using theorder
method of extracting variables instead.Todo / Next steps
This PR is pretty close, but blocked by determining how to pass authentication (#433). After auth is addressed this PR can be merged into the
getvars
branch, where work has already started changing the way theVariables
class is called inRead
andQuery
.parse
as a library requirement